Skip to content

Enable Cross layers KV cache layout at NIXL Connector#30207

Merged
NickLucche merged 62 commits intovllm-project:mainfrom
liranschour:nixl_kv_cont_cross_layers
Jan 22, 2026
Merged

Enable Cross layers KV cache layout at NIXL Connector#30207
NickLucche merged 62 commits intovllm-project:mainfrom
liranschour:nixl_kv_cont_cross_layers

Conversation

@liranschour
Copy link
Copy Markdown
Contributor

@liranschour liranschour commented Dec 7, 2025

Purpose

Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC #27742 and implemented in #27743.
Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.

Tested with P!=D with run_accuracy_test.sh P=1 D=2

Tested on H100 1P1D Qwen/Qwen3-0.6B:

branch num reqs input len TTFT ITL tok/s Desc/transfer
main 1000 16 18756.42 5.35 5288.41 56
kv_cross_layers 1000 16 8494.44 8.74 8572.56 1
main 128 1024 1660.84 9.40 37945.20 3528
kv_cross_layers 128 1024 686.98 9.26 55418.76 1
main 128 10240 11140.52 42.78 62339.74 34000
kv_cross_layers 128 10240 5226.71 14.41 117631.48 422

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the use of a continuous cross-layer KV cache layout with the NIXL connector, which is a great performance optimization. The changes correctly signal the preference for this layout and add the necessary hooks to handle it.

However, I've found a critical issue in the implementation within NixlConnectorWorker.register_kv_caches. The existing logic does not correctly handle the shape of the cross-layer KV cache tensor, leading to incorrect calculation of num_blocks and num_layers. This will likely cause the KV cache transfer to fail for multi-layer models. I've added a detailed comment on this.

Once this issue is addressed, the PR should be in good shape.

@mergify
Copy link
Copy Markdown

mergify bot commented Dec 7, 2025

Hi @liranschour, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +359 to +363
cross_layer_name = "ALL_LAYERS"

kv_caches = {cross_layer_name: kv_cache}
self.connector_worker.cross_layers = True
self.connector_worker.register_kv_caches(kv_caches)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cross-layer caches break post-processing for heterogeneous blocks

The new cross-layer registration stores a single combined KV tensor under one key and sets cross_layers = True, but downstream post-processing still assumes split K/V caches whenever kv_topo.split_k_and_v is true. In the heterogeneous block-size path (block_size_ratio > 1) the code in blocksize_post_process/permute_device_kv indexes device_kv_caches as if each value were a (k,v) tuple, so with the cross-layer tensor it iterates over its first dimension and sample_cache = …[0] becomes a single block rather than a cache tensor. That path will raise shape errors or silently corrupt transfers as soon as heterogeneous block sizes are encountered, which are explicitly supported by this connector. The cross-layer registration should either propagate a non-split layout or adjust the post-processing to handle the combined tensor.

Useful? React with 👍 / 👎.

Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
@liranschour liranschour force-pushed the nixl_kv_cont_cross_layers branch from d75ca47 to 0f36888 Compare December 7, 2025 13:37
@mergify mergify bot added the v1 label Dec 8, 2025
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 8, 2025

Hi @liranschour, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@liranschour liranschour force-pushed the nixl_kv_cont_cross_layers branch from b3b5d45 to 0f36888 Compare December 8, 2025 14:34
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Copy link
Copy Markdown
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Thanks!

Also wanted to share a summary I prepared on how this change effects the I/O size per different models:

Model Old I/O size New I/O size
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B (tensor_parallel_size=2) 16 KB 2 MB
deepseek-ai/DeepSeek-V2-Lite-Chat (GPU block size=64) 72 KB 1.9 MB
meta-llama/Llama-3.1-8B-Instruct 32 KB 2 MB
meta-llama/Llama-3.2-1B-Instruct 16 KB 0.5 MB
meta-llama/Llama-3.1-70B-Instruct 8 KB 1.25 MB
mistralai/Mistral-7B-Instruct-v0.2 32 KB 2 MB
mistralai/Mistral-Small-24B-Instruct-2501 32 KB 2.5 MB
Qwen/Qwen2.5-3B-Instruct 8 KB 0.56 MB
Qwen/Qwen3-0.6B 32 KB 1.75 MB
Qwen/Qwen2.5-7B-Instruct 16 KB 0.87 MB
Qwen/Qwen3-4B-Instruct-2507 32 KB 2.25 MB
Qwen/Qwen2.5-1.5B-Instruct 8 KB 0.44 MB
Qwen/Qwen3-8B 28 KB 1.97 MB
Qwen/Qwen3-1.7B 32 KB 1.75 MB
Qwen/Qwen3-32B (tensor_parallel_size=2) 16 KB 2 MB

if vllm_config.kv_transfer_config is None:
raise ValueError("kv_transfer_config must be set for NixlConnector")
self.kv_transfer_config = vllm_config.kv_transfer_config
self.cross_layers = False
Copy link
Copy Markdown
Collaborator

@orozery orozery Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should move this variable into TpKVTopology.
Then TpKVTopology.split_k_and_v can return False if we are in a cross layers case.
This should also address the bug that gemini pointed out in the current revision of this PR.

This change means we should defer the initialization of TpKVTopology from NixlConnectorWorker.__init__ to NixlConnectorWorker.register_kv_caches.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved this logic under TpKVToplogy

Comment on lines 1219 to 1224
# TODO (NickLucche): Get kernel_block_size in a cleaner way
# NHD default "view" for non-MLA cache
if self.device_type == "cpu":
if self.device_type == "cpu" or self.cross_layers:
block_size_position = -2
else:
block_size_position = -2 if self.use_mla else -3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suggested to eliminate the self.cross_layers variable,
I suggest that we move this entire logic behind a new property TpKVTopology.block_size_position.
cc @NickLucche

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved this logic under TpKVToplogy

# Attention backend and KV cache dtype affect memory layout
"attn_backend_name": attn_backend_name,
"cache_dtype": str(cache_config.cache_dtype),
"cross_layers": CROSS_LAYERS,
Copy link
Copy Markdown
Collaborator

@orozery orozery Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option to consider:
Defer the computation of self.compat_hash to NixlConnectorWorker.register_kv_caches.
This will allow you to conditionally set this new cross_layers compatibility constraint only if cross layers is actually being used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deferred the compatibility computation to NixlConnectorWorker.register_kv_caches

Signed-off-by: Liran Schour <lirans@il.ibm.com>
@liranschour liranschour force-pushed the nixl_kv_cont_cross_layers branch from 4bf547e to 2a20197 Compare December 11, 2025 08:33
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 11, 2025

Hi @liranschour, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @liranschour and @orozery !
Planning to review/run some benchmarks asap, in the meantime can we run tp_config_sweep.ch to verify heteroTP correctness with this layout? Thanks!

Also cc @xuechendi

@@ -74,6 +74,7 @@
# 2: Add remote_request_id to kv_transfer_params
#
NIXL_CONNECTOR_VERSION: int = 2
CROSS_LAYERS: bool = True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this. It is a leftover from previous version of the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed @NickLucche

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche I am able to run successfully tp_config_sweep_accuracy_test.sh only on Qwen/Qwen3-0.6B:

  • "GPU_MEMORY_UTILIZATION=0.6 PREFILLER_TP_SIZE=2 DECODER_TP_SIZE=2"
  • "GPU_MEMORY_UTILIZATION=0.6 PREFILLER_TP_SIZE=1 DECODER_TP_SIZE=2"

DeppSeek models are failing on my environment also on the main branch. So probably I have some error in the environment.

Signed-off-by: Liran Schour <lirans@il.ibm.com>
@liranschour liranschour force-pushed the nixl_kv_cont_cross_layers branch from 85d5cf0 to 073b30e Compare December 11, 2025 12:52
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 11, 2025

Hi @liranschour, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Liran Schour <lirans@il.ibm.com>
@liranschour liranschour force-pushed the nixl_kv_cont_cross_layers branch from 67d23fe to b403a9e Compare December 14, 2025 14:27
@liranschour liranschour requested a review from orozery December 14, 2025 14:29
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 14, 2025

Hi @liranschour, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

attn_backend: type[AttentionBackend]
engine_id: str
remote_block_size: dict[str, int]
cross_layers: bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe change to cross_layers_blocks / cross_layers_kv_cache or something else that is a bit more self-explanatory?
Also applies for the other cross_layers introduced in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return self.remote_block_size[self.engine_id]

def block_size_position(self, device_type: str) -> int:
if device_type == "cpu" or self.cross_layers:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that device_type is nixl-connector specific. Can you move it out from here to the nixl-connector?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logger.debug("Detected attention backend %s", self.backend_name)
logger.debug("Detected kv cache layout %s", self.kv_cache_layout)

self.compat_hash = compute_nixl_compatibility_hash(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to leave initialization to None here, so that all fields are declared in __init__.
See for example how it's done in gpu_model_runner.py:

        # Lazy initializations
        # self.model: nn.Module  # Set after load_model
        # Initialize in initialize_kv_cache
        self.kv_caches: list[torch.Tensor] = []
        # Initialize in initialize_kv_cache_tensors
        self.cross_layers_kv_cache: torch.Tensor | None = None
        self.cross_layers_attn_backend: type[AttentionBackend] | None = None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]):
"""Register the KV Cache data in nixl."""

backend = get_attn_backend(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: introduce self.attn_backend in __init__ and re-use it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

is_mla=self.use_mla,
total_num_kv_heads=self.model_config.get_total_num_kv_heads(),
attn_backend=backend,
cross_layers=next(iter(kv_caches)) == "ALL_LAYERS",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small edge-case if a model has a layer named ALL_LAYERS :)
In the offloading connector we detect a cross layers kv cache differently:

test_shape = attn_backend.get_kv_cache_shape(
num_blocks=1234, block_size=16, num_kv_heads=8, head_size=256
)
has_layers_dim = False
if len(gpu_shape) != len(test_shape):
# cross-layers tensor
# shape is (num_blocks, ...)
assert len(gpu_shape) == len(test_shape) + 1

Actually, maybe you can introduce TpKVTopology.tensor_shape initialized from the kv_caches tensor shape, and then set _cross_layers in TpKVTopology.__post_init__ ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1185 to +1187
self.enforce_compat_hash = self.kv_transfer_config.get_from_extra_config(
"enforce_handshake_compat", True
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved back to __init__.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

attn_backend=backend,
cross_layers=next(iter(kv_caches)) == "ALL_LAYERS",
)
self._use_pallas = self.kv_topo._use_pallas
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can either move this back to __init__ (setting it using attn_backend == AttentionBackendEnum.PALLAS),
or remove it all together and switch self._use_pallas to self.kv_topo.use_pallas (new property).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@liranschour liranschour force-pushed the nixl_kv_cont_cross_layers branch 4 times, most recently from ad45e7c to 02718ad Compare December 17, 2025 10:14
monajafi-amd pushed a commit to monajafi-amd/vllm that referenced this pull request Jan 23, 2026
…0207)

Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026
…0207)

Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
Signed-off-by: 陈建华 <1647430658@qq.com>
lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026
…0207)

Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
@liranschour liranschour mentioned this pull request Jan 28, 2026
5 tasks
orozery added a commit to orozery/vllm that referenced this pull request Jan 28, 2026
orozery added a commit to orozery/vllm that referenced this pull request Jan 28, 2026
…roject#30207)"

This reverts commit 64e3d67.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
NickLucche pushed a commit that referenced this pull request Jan 28, 2026
…" (#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
khluu added a commit that referenced this pull request Jan 28, 2026
…" (#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: PiratePai <416932041@qq.com>
Signed-off-by: Pai <416932041@qq.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 15, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…0207)

Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 22, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 23, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 26, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 27, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 28, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 28, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Mar 19, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants